-
Notifications
You must be signed in to change notification settings - Fork 13.7k
ggml-cpu: handle 3d tensors in repack mat_mul #17030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-cpu: handle 3d tensors in repack mat_mul #17030
Conversation
eadb483 to
950671d
Compare
|
@ggerganov This is ready for review now. Thanks for your patience. |
|
|
||
| const char * src0_ptr = (const char *) src0->data + i02 * nb02; | ||
| const char * src1_ptr = (const char *) params->wdata + (i11 + i12 * ne11) * src1_col_stride; | ||
| char * dst_ptr = ((char *) dst->data + (i1 * nb1 + i2 * nb2)); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add GGML_ASSERT here that guarantees we are within bounds of [params->wdata, params->wdata + params->wsize)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added one GGML_ASSERT for the upper bound. The lower bounds should always be > params->wdata since as long as ne1 and ne11 are >= 1, i11 and i12 are positive.
5a202b9 to
d1938ad
Compare
|
@ggerganov I've addressed all your comments. Let me know if something else is required. |
|
@Alcpz Here is llama3.2-1B-Q4_0 running with 6 threads with instrumented matmul code. The instrumentation simply counts number of processed chunks and the time per thread That's way too many chunks and we burn a lot of time on syncrhonization. |
|
Mmm. Let's revert this then. I will reopen a PR with the branch as a draft and we can have a better solution. I'd rather not introduce a regression upstream. @ggerganov Mind doing the revert? |
This reverts commit 1c398dc.
While testing #16739, perplexities for LFM2 skyrocketed. @ggerganov pointed out that some matrix shapes would probably not be supported.
LFM2 has some layers that have two batches, so MAT_MULs were only done partially, leading to incorrect results. See #16739 (comment)
This patch adds basic support for tensors with
ne2 > 1, with very naive chunking based on the non repack MUL MAT.Perplexities using this patch:
I can provide logs for other models if needed.
Repro commands:
Other models: